{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dcc9ef1a",
   "metadata": {},
   "source": [
    "**<font color = black size=6>实验五:决策树</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba67a4f6",
   "metadata": {},
   "source": [
    "**<font color = blue size=4>第一部分:函数介绍</font>**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6ac93628",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import math"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cfffb4e",
   "metadata": {},
   "source": [
    "1)Counter类的使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "85c3787c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Counter({133: 2, 132: 1})\n",
      "0\n",
      "2\n",
      "dict_values([2, 1])\n",
      "[133, 132]\n"
     ]
    }
   ],
   "source": [
    "x=np.array([[0,133,1],[0,132,0],[0,133,0]])\n",
    "#使用Counter类对数组第2列进行遍历\n",
    "counter = Counter(x[:,1])\n",
    "#第2列中有1个132和2个133，输出该counter对象可以统计这列的数值情况，便于之后的统计\n",
    "print(counter)\n",
    "#因为第2列中没有为0的值，所以返回0\n",
    "print(counter[0])\n",
    "#因为第2列中有2个133，所以返回2\n",
    "print(counter[133])\n",
    "#一般的字典操作方法都能在该类中使用，例如可以通过values函数返回该列的非重复值的个数，方便对某列的非重复值的个数进行查看\n",
    "print(counter.values())\n",
    "#可以输出所有非重复值\n",
    "print(list(counter))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfc27d84",
   "metadata": {},
   "source": [
    "2)使用numpy中的unique实现非重复值的提取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1a954859",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[  0   1 132 133] [132 133]\n"
     ]
    }
   ],
   "source": [
    "x=np.array([[0,133,1],[0,132,0],[0,133,0]])\n",
    "a=np.unique(x[:])\n",
    "a1=np.unique(x[:,1])\n",
    "print(a,a1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95b28b2f",
   "metadata": {},
   "source": [
    "**<font color = blue size=4>第二部分:实验任务</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91631e59",
   "metadata": {},
   "source": [
    "**<font color = blue size=3>1) 任务一</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a8df3a6",
   "metadata": {},
   "source": [
    "任务一要求完成不同量化标准(信息增益, 信息增益率, 基尼系数)下的结点划分函数"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cbe46fb",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">该数据集(train_titanic.csv)为分类数据集，为泰坦尼克号的部分乘客信息以及最后是否生还。包括了四个属性特征以及一个标签(即为Survived,代表是否生还),属性特征分别为Sex(性别)，sibsp(堂兄弟妹个数)，Parch(父母与小孩的个数)，Pclass(乘客等级)  \n",
    "其中该数据集无缺失值和异常值，且所有连续变量已自动转换为离散变量,标签(Survived)也自动转变为离散变量0和1，所以你无需进行数据预处理，可以直接使用该数据集。</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1dc6ccdb",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import math"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9570e618",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">1) 使用pandas库将训练数据集'train_titanic.csv'载入到Dataframe对象中</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a1b1263e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#----your code here\n",
    "df = pd.read_csv(\"train_titanic.csv\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09f3bc41",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">2) 编写函数，给定任何标签数组计算其信息熵  \n",
    "    输入：标签数组  \n",
    "    输出：该数组对应的信息熵  \n",
    "    计算信息熵公式:\n",
    "某数组包含K个不同的取值，样本为第k(k=1,2,...,K)个值的数量所占比例为p_k,则其信息熵为$$Ent=-\\sum_{k=1}^K p_k log_2 p_k$$</span>\n",
    "    \n",
    "    \n",
    "<span style=\"color:purple\">例:  \n",
    "    输入:[[0],[1]]   \n",
    "    输出:(-$\\frac{1}{2}$ $log_2$ $\\frac{1}{2}$)+(-$\\frac{1}{2}$ $log_2$ $\\frac{1}{2}$)=1</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "eaa569aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "def entropy(label):\n",
    "    k = len(label)\n",
    "    # print(k)\n",
    "    print(label)\n",
    "    counter = Counter(label[:, 0])\n",
    "    ent = -sum([(counter[i] / k) * math.log((counter[i] / k), 2) for i in counter])\n",
    "    return ent\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d77ab84",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">3) 编写函数，函数功能为将所给的数据集按照指定维度的特征进行划分为若干个不同的数据集  \n",
    "    【输入】：特征集合，标签集合，指定维度  \n",
    "    【输出】：划分后所得到的子树属性集合，子树标记集合</span>\n",
    "\n",
    "<span style=\"color:purple\">例子:  \n",
    "    【输入】:特征集合:[[0,0,0],[0,0,1],[1,0,2]]  \n",
    "    标签集合:[[0],[1],[2]]  \n",
    "    指定维度:0</span>  \n",
    "    \n",
    "<span style=\"color:purple\">【输出】:[[0,0,0],[0,0,1]]和[[1,0,2]]  \n",
    "    [[0],[1]]和[[2]]  \n",
    "    tips:即将特征按其第0维度进行划分，由于第0维特征有0和1两个不同的数值，所以特征集合划分为[[0,0,0],[0,0,1]]和[[1,0,2]]，标签集合划分为[[0],[1]]和[[2]]</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8f1460ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "def split(feature, label, d):\n",
    "    partitioned_feature = {}\n",
    "    partitioned_label = {}\n",
    "    for i in range(len(feature)):\n",
    "        if feature[i][d] not in partitioned_feature:\n",
    "            partitioned_feature[feature[i][d]] = []\n",
    "            partitioned_label[feature[i][d]] = []\n",
    "        partitioned_feature[feature[i][d]].append(feature[i])\n",
    "        partitioned_label[feature[i][d]].append(label[i])\n",
    "\n",
    "    split_feature = []\n",
    "    split_label = []\n",
    "    # split_feature = [\n",
    "    #     [item.tolist() for item in sublist] for sublist in partitioned_feature\n",
    "    # ]\n",
    "    # split_label = [[item.tolist() for item in sublist] for sublist in partitioned_label]\n",
    "    for i in partitioned_feature:\n",
    "        split_feature.append(partitioned_feature[i])\n",
    "        split_label.append(partitioned_label[i])\n",
    "    features_without_array = [\n",
    "        [item.tolist() for item in sublist] for sublist in split_feature\n",
    "    ]\n",
    "    labels_without_array = [\n",
    "        [item.tolist() for item in sublist] for sublist in split_label\n",
    "    ]\n",
    "    return features_without_array, labels_without_array\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47b0230f",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">4) 编写函数，函数功能为进行【一次】决策树的结点划分，遍历找出该特征集合中信息增益(使用ID3算法中的公式计算)【最大】的特征  \n",
    "    输入：特征集合，标签集合  \n",
    "    输出：该次划分的最佳信息增益值，最佳划分维度  \n",
    "    计算信息增益公式:  \n",
    "    某数据集D有若干特征值以及对应的标签值，其总样本大小为|D|,这里取其中一个特征feature,该特征包含V个不同的取值，其中值为第v(v=1,2,...,V)个值的样本数量为|$D^v$|$(\\sum_{v=1}^VD^v=|D|)$,则该特征值对应的信息增益为$$Gain(D,feature)=Ent(D)-\\sum_{v=1}^K \\frac{|D^v|}{D} Ent(D^v)$$\n",
    "该函数中你需要计算出每个特征的信息增益并输出，然后返回最佳的信息增益值和对应的特征的维数</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "242439cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "def one_split_ID3(x_data, y_label):\n",
    "    feature_len = len(x_data[0])\n",
    "    Ent_D = entropy(y_label)\n",
    "    Gain_D = [0] * feature_len\n",
    "    for i in range(feature_len):\n",
    "        split_feature, split_label = split(x_data, y_label, i)\n",
    "        Gain_D[i] = Ent_D - sum(\n",
    "            [\n",
    "                (len(split_label[j]) / len(y_label)) * entropy(split_label[j])\n",
    "                for j in range(len(split_label))\n",
    "            ]\n",
    "        )\n",
    "    best_entropy = max(Gain_D)\n",
    "    best_dimension = Gain_D.index(max(Gain_D))\n",
    "    return best_entropy, best_dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75f8593e",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">5) 编写函数，函数功能为进行【一次】决策树的结点划分，遍历找出该特征集合中信息增益率(使用C4.5算法中的公式计算)【最大】的特征  \n",
    "    输入：特征集合，标签集合  \n",
    "    输出：最佳划分的信息增益率值，对应的划分维度  \n",
    "    计算信息增益率公式:  \n",
    "    某数据集D有若干特征值以及对应的标签值，其总样本大小为|D|,这里取其中一个特征类型feature,该特征包含V个不同的取值，值为第v(v=1,2,...,V)个值的样本数量为|$D^v$|$(\\sum_{v=1}^V|D^v|=|D|)$,该特征本身的信息熵为Ent(feature),则该特征值对应的信息增益率为$$Gain\\_ratio(D,feature)=\\frac{Gain(D,feature)}{Ent(feature)}$$\n",
    "该函数中你需要输出每个特征的信息增益率，之后返回最佳的信息增益率和对应的特征维数</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "91d51379",
   "metadata": {},
   "outputs": [],
   "source": [
    "def one_split_C4_5(x_data, y_label):\n",
    "    feature_len = len(x_data[0])\n",
    "    Gain_ratio = [0] * feature_len\n",
    "    Ent_D = entropy(y_label)\n",
    "    Gain_D = [0] * feature_len\n",
    "    for i in range(feature_len):\n",
    "        split_feature, split_label = split(x_data, y_label, i)\n",
    "        Gain_D[i] = Ent_D - sum(\n",
    "            [\n",
    "                (len(split_label[j]) / len(y_label)) * entropy(split_label[j])\n",
    "                for j in range(len(split_label))\n",
    "            ]\n",
    "        )\n",
    "        Gain_ratio[i] = Gain_D[i] / entropy(x_data[:, i])\n",
    "    best_entropy = max(Gain_ratio)\n",
    "    best_dimension = Gain_ratio.index(max(Gain_ratio))\n",
    "    return best_entropy, best_dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee40f45a",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">6) 编写函数，进行【一次】决策树的结点划分，遍历找出该特征集合中基尼系数(使用CART算法中的公式计算)最小的特征以及最佳的划分值  \n",
    "    输入：特征集合，标签集合  \n",
    "    输出：最佳的基尼系数，对应的划分维度，最佳划分值  \n",
    "    计算基尼系数公式:  \n",
    "    某数据集D有若干特征值以及对应的标签值，其总样本大小为|D|,该集合中标签类别总共有K类，第k类样本所占比例为$p_k$(k=1,2,..,K),则该数据集对应的基尼系数为$$Gini(D)=1-\\sum_{k=1}^K{p_k}^2$$  \n",
    "    而取其中一个特征feature，选定该特征的一个值value，根据该特征的值是否为value将数据集分为两个子集$D_1$和$D_2$,则该特征对应的基尼系数为$$Gini\\_index(D,feature)=\\frac{|D_1|}{|D|}Gini(D_1)+\\frac{|D_2|}{|D|}Gini(D_2)$$\n",
    "该函数中你需要遍历每一列特征，找出每列中的非重复值，计算出每个非重复值的基尼系数，返回最小的基尼系数、对应的特征维数和非重复值（分类值）。</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d073721e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def one_split_CART(x_data, y_label):\n",
    "    \n",
    "    return best_entropy, best_dimension,best_value"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4bdc9f0",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">7) 应用之前你在第4、5、6个部分编写的三个函数，在训练数据集'train_titanic.csv'上依次使用这些函数进行【一次】结点划分，并输出对应的最佳特征维数以及相应的信息增益值/信息增益率/(基尼系数和分类值)</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "8769cb89",
   "metadata": {},
   "outputs": [],
   "source": [
    "#----your code here\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df259d58",
   "metadata": {},
   "source": [
    "**<font color = blue size=3>2) 任务二</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c5fe4ac",
   "metadata": {},
   "source": [
    "任务二承接任务一，用【ID3】算法实现一棵完整的决策树。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f67d941",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">1) 整理任务一的代码，根据实验二要求做适当修改来编写函数，【从剩余的特征集A中】寻找使得信息增益最大的特征   \n",
    "    输入：数据集D、剩余的特征集A    \n",
    "    输出：最佳划分的特征维数    \n",
    "    计算信息增益公式:  \n",
    "    某数据集D有若干特征值以及对应的标签值，其总样本大小为|D|,这里取其中一个特征feature,该特征包含V个不同的取值，特征值为第v(v=1,2,...,V)个值的样本数量为|$D^v$|$(\\sum_{v=1}^VD^v=|D|)$,则该特征对应的信息增益为$$Gain(D,feature)=Ent(D)-\\sum_{v=1}^K \\frac{|D^v|}{D} Ent(D^v)$$</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "bf436ee0",
   "metadata": {},
   "outputs": [],
   "source": [
    "def best_split(D, A):\n",
    "    \n",
    "    return best_dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68ceb0dc",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">2) 完成DTree类中的TreeGenerate、train函数以完成决策树的构建。并完成DTree类中的predict函数来用构建好的决策树来对测试数据集进行预测并输出预测准确率。</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d9473e3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 树结点类\n",
    "class Node:\n",
    "    def __init__(self, isLeaf=True, label=-1, feature_index=-1):\n",
    "        self.isLeaf = isLeaf # isLeaf表示该结点是否是叶结点\n",
    "        self.label = label # label表示该叶结点的label（当结点为叶结点时有用）\n",
    "        self.feature_index = feature_index # feature_index表示该分支结点的划分特征的序号（当结点为分支结点时有用）\n",
    "        self.children = {} # children表示该结点的所有孩子结点，dict类型，方便进行决策树的搜索\n",
    "        \n",
    "    def addNode(self, val, node):\n",
    "        self.children[val] = node #为当前结点增加一个划分特征的值为val的孩子结点"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "12a28664",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 决策树类\n",
    "class DTree:\n",
    "    def __init__(self):\n",
    "        self.tree_root = None #决策树的根结点\n",
    "        self.possible_value = {} # 用于存储每个特征可能的取值\n",
    "    \n",
    "        \n",
    " \n",
    "    '''\n",
    "    TreeGenerate函数用于递归构建决策树，伪代码参照课件中的“Algorithm 1 决策树学习基本算法”\n",
    "     \n",
    "    '''    \n",
    "    \n",
    "    def TreeGenerate(self, D, A):\n",
    "        \n",
    "        # 生成结点 node\n",
    "        \n",
    "        node = Node()\n",
    "\n",
    "        \n",
    "        # if D中样本全属于同一类别C then\n",
    "        #     将node标记为C类叶结点并返回\n",
    "        # end if\n",
    "        \n",
    "       \n",
    "        \n",
    "        \n",
    "        # if A = Ø OR D中样本在A上取值相同 then\n",
    "        #     将node标记叶结点，其类别标记为D中样本数最多的类并返回\n",
    "        # end if\n",
    "        \n",
    "        \n",
    "        \n",
    "        \n",
    "        # 从A中选择最优划分特征a_star\n",
    "        # （选择信息增益最大的特征，用到上面实现的best_split函数） \n",
    "        \n",
    "        \n",
    "        # for a_star 的每一个值a_star_v do\n",
    "        #     为node 生成每一个分支；令D_v表示D中在a_star上取值为a_star_v的样本子集\n",
    "        #     if D_v 为空 then\n",
    "        #         将分支结点标记为叶结点，其类别标记为D中样本最多的类\n",
    "        #     else\n",
    "        #         以TreeGenerate(D_v,A-{a_star}) 为分支结点\n",
    "        #     end if\n",
    "        # end for\n",
    "            \n",
    "    \n",
    "    \n",
    "    \n",
    " \n",
    "    '''\n",
    "    train函数可以做一些数据预处理（比如Dataframe到numpy矩阵的转换，提取属性集等），并调用TreeGenerate函数来递归地生成决策树\n",
    " \n",
    "    '''\n",
    "    \n",
    "    def train(self, D):\n",
    "#         D = np.array(D) # 将Dataframe对象转换为numpy矩阵（也可以不转，自行决定做法）\n",
    "#         A = set(range(D.shape[1] - 1)) # 特征集A\n",
    "        \n",
    "#         #记下每个特征可能的取值\n",
    "#         for every in A:\n",
    "#             self.possible_value[every] = np.unique(D[:, every])\n",
    "        \n",
    "#         self.tree_root = self.TreeGenerate(D, A) # 递归地生成决策树，并将决策树的根结点赋值给self.tree_root\n",
    "        return\n",
    "\n",
    "    \n",
    "    '''\n",
    "    predict函数对测试集D进行预测， 并输出预测准确率（预测正确的个数 / 总数据数量）\n",
    "    \n",
    "    '''\n",
    "    \n",
    "    def predict(self, D):\n",
    "        \n",
    "#         D = np.array(D) # 将Dataframe对象转换为numpy矩阵（也可以不转，自行决定做法）      \n",
    "#         #对于D中的每一行数据d，从当前结点x=self.tree_root开始，当当前结点x为分支结点时，\n",
    "\n",
    "#         #则搜索x的划分特征为该行数据相应的特征值的孩子结点（即x=x.children[d[x.index]]），不断重复，\n",
    "#         #直至搜索到叶结点，该叶结点的标签就是数据d的预测标签\n",
    "        return acc\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d647f2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "#----your code here\n",
    "\n",
    "# 构建决策树\n",
    "\n",
    "\n",
    "# 利用构建好的决策树对测试数据集进行预测，输出预测准确率（预测正确的个数 / 总数据数量）\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1e214a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "#展示生成的决策树结构\n",
    "def display_tree(node, indent=''):\n",
    "    if node.isLeaf:\n",
    "        print(indent + \"Leaf Node: label =\", node.label)\n",
    "    else:\n",
    "        print(indent + \"Branch Node: feature_index =\", node.feature_index)\n",
    "        for value, child in node.children.items():\n",
    "            print(indent + \"|--> Child Value:\", value)\n",
    "            display_tree(child, indent + \"   \")\n",
    "\n",
    "\n",
    "display_tree()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd7876bb",
   "metadata": {},
   "source": [
    "**<font color = blue size=4>第三部分:作业提交</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82677693",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">本次实验分两周完成,实验报告提交地址(学号+姓名+实验五):https://send2me.cn/s0Yr-w4z/QCKd4ibz34miqQ   截止日期:10.20 14:20</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95691245",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">实验课件获取地址:https://www.jianguoyun.com/p/DZRe9cwQp5WhChjz4p8FIAA</span>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
